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ABSTRACT 

Complete knowledge of all direct and indirect inter- 
actions between proteins in a given cell would 
represent an important milestone towards a com- 
prehensive description of cellular mechanisms and 
functions. Although this goal is still elusive, consid- 
erable progress has been made— particularly for 
certain model organisms and functional systems. 
Currently, protein interactions and associations are 
annotated at various levels of detail in online 
resources, ranging from raw data repositories to 
highly formalized pathway databases. For many 
applications, a global view of all the available inter- 
action data is desirable, including lower-quality data 
and/or computational predictions. The STRING 
database (http://string-db.org/) aims to provide 
such a global perspective for as many organisms 
as feasible. Known and predicted associations are 
scored and integrated, resulting in comprehensive 
protein networks covering >1100 organisms. Here, 
we describe the update to version 9.1 of STRING, 
introducing several improvements: (i) we extend 
the automated mining of scientific texts for inter- 
action information, to now also include full-text 
articles; (ii) we entirely re-designed the algorithm 
for transferring interactions from one model 
organism to the other; and (iii) we provide users 
with statistical information on any functional enrich- 
ment observed in their networks. 



INTRODUCTION 

Highly complex organisms and behaviors can arise from a 
surprisingly restricted set of existing gene famihes (1,2), by 
a tightly regulated network of interactions among the 
proteins encoded by the genes. This functional web of 
protein-protein hnks extends well beyond direct physical 
interactions only; indeed, physical interactions might also 
be rather limited, covering perhaps < 1 % of the theoretic- 
ally possible interaction space (3). Proteins do not neces- 
sarily need to undergo a stable physical interaction to have 
a specific, functional interplay: they can catalyze subse- 
quent reactions in a metabolic pathway, regulate each 
other transcriptionally or post-transcriptionally, or 
jointly contribute to larger, structural assemblies without 
ever making direct contact. Together with direct, physical 
interactions, such indirect interactions constitute the 
larger superset of 'functional protein-protein associations' 
or 'functional protein linkages' (4,5). 

Protein-protein associations have proven to be a useful 
concept, by which to group and organize all protein- 
coding genes in a genome. The complete set of associ- 
ations can be assembled into a large network, which 
captures the current knowledge on the functional modu- 
larity and interconnectivity in the cell. Apart from ad hoc 
use — i.e. by browsing networks for genes of interest, 
inspecting interaction evidence or performing interactive 
clustering — a variety of systematic and large-scale 
usage scenarios for functional association networks have 
emerged. For example, (i) association networks have been 
frequently used to interpret the results of genome-wide 
genetic screens, in particular RNAi perturbation screens 
(6-9). Because such screens can be noisy and difficult to 
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interpret, any protein-network information tliat may lielp 
to connect potential liits can serve to provide additional 
confidence, particularly if a number of hits can be 
observed in a densely connected functional module in 
the network, (ii) Protein network information can aid in 
the interpretation of functional genomics data, e.g. in sys- 
tematic proteomics surveys (10-12). This is particularly 
useful when the proteomics data themselves contain a 
protein-protein association component, such as in 
MS-based interaction discovery or in large-scale enzyme/ 
substrate analysis, (iii) Protein association networks have 
also proven surprisingly useful for the elucidation of 
disease genes, both for Mendelian and for complex 
diseases (13-15). For the latter application, the networks 
can help to constrain the search space — genomic regions 
encompassing more than one candidate gene, or Hsts of 
genes observed to be mutated in sequencing studies, can be 
filtered for those genes that have connections to known 
disease genes (or for genes having above-random connect- 
ivity among themselves). 

The STRING database has been designed with the goal 
to assemble, evaluate and disseminate protein-protein as- 
sociation information, in a user-friendly and comprehen- 
sive manner. As interactions between proteins represent 
such a crucial component for modern biology, STRING 
is by far not the only onhne resource dedicated to this 
topic. Apart from the primary databases that hold the 
experimental data in this field (16-20) and hand-curated 
databases serving expert annotations (21,22), a number 
of resources take a meta-analysis approach, similar to 
STRING. These include GeneMANIA (23), Consensus- 
PathDB (24), I2D (25), VisANT (26) and, more recently, 
hPRINT (27), HitPredict (28), IMID (29) and IMP (30). 
Within this wide variety of online resources and databases 
dedicated to interactions, STRING specializes in three 
ways: (i) it provides uniquely comprehensive coverage, 
with >1000 organisms, 5niilhon proteins and >200 
milhon interactions stored; (ii) it is one of very few sites 
to hold experimental, predicted and transferred inter- 
actions, together with interactions obtained through text 
mining; and (iii) it includes a wealth of accessory informa- 
tion, such as protein domains and protein structures, im- 
proving its day-to-day value for users. 

We have already discussed many aspects of the 
STRING resource previously, e.g. (31,32), including its 
data-sources, prediction algorithms and user-interface. 
Here, we describe the current update to version 9.1 of 
the resource, focusing on new features and updated algo- 
rithms. In particular, we will describe how STRING in- 
creasingly makes use of externally provided orthology 
information [from the eggNOG database (33)] to better 
integrate evidence across distinct organisms. 



a name is written as one word, two words or with a 
hyphen. As in the previous versions of STRING, associ- 
ations between proteins are derived from statistical 
analysis of co-occurrence in documents and from natural 
language processing. The latter combines part-of-speech 
tagging, semantic tagging and a chunking grammar to 
achieve rule-based extraction of physical and regulatory 
interactions, as described previously (34). 

To improve the quality and number of hnks derived 
from co-occurrence, we have developed an entirely new 
scoring scheme, which takes into account co-occurrences 
within sentences, within paragraphs and within whole 
documents and combines them through an optimized 
weighting scheme. 

The scoring scheme first calculates a weighted count 
(Qj) for each pair of entities / and /: 

n 
k=l 

where w,/= 1, Wp = 2 and ir, = 0.2 are the weights for 
co-occurrence within the same document, same paragraph 
and same sentence, respectively. The delta functions 8^^-^, 
Spijk and ^.„yA are 1 , if the entities / and / are co-mentioned 
in the document k, a paragraph of k or a sentence of k. 
Based on the weighted counts, the co-occurrence score 
(Sij) is defined as: 



where Q. and C.j are the sums over all pairs involving / or 
7 and an entity from the same taxon, C.. is the sum over all 
pairs of entities from the taxon, and a = 0.6. The param- 
eters were optimized on the KEGG benchmark set. 

This has substantially improved the quality and number 
of associations extracted (Table 1). The more efficient 
named entity recognition engine and the new scoring 
scheme also enabled us to move beyond the parsing of 
MEDLINE abstracts, and to now include text mining of 
1 821 983 full-text articles, which were freely available 
from publishers web sites. This has further improved the 
comprehensiveness of the text mining in the new version of 
STRING (Table 1). The natural language processing part 
of the pipehne has also been standardized, to make use of 
an ontology that describes possible molecular modes of 
action by which proteins can influence each other (35). 
Finally, the new text-mining pipehne explicitly takes into 
account orthology information by treating each ortholo- 
gous group as an entity that is considered whenever one of 
its member proteins is mentioned (33), thereby directly 
detecting associations between orthologous groups as 
well as between proteins. 



UPDATED TEXT MINING 

The new version of STRING features a redesigned text- 
mining pipehne. We have improved the named entity rec- 
ognition engine to use custom-made hashing and 
string-compare functions to comprehensively and effi- 
ciently handle orthographic variation related to whether 



TRANSFER OF INTERACTIONS BETWEEN 
ORGANISMS 

Evolutionarily related proteins are known to usually main- 
tain their three-dimensional structure, even when they 
have become so diverged over time that there is hardly 
any detectable sequence similarity left between them 
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Table 1. Protein-protein associations based on automated text mining 



STRING v9.0 



STRING v9.1 



Fold increase 



Natural language processing 
Cooccurrence, high confidence 
Cooccurrence, medium confidence 
Cooccurrence, low confidence 



38 859 
286 880 
I 100 756 
3214754 



63 331 
792 730 
1 672 222 
4270 322 



1.629 
2.763 
1.519 
1.328 



This table quantifies non-redundant associations extracted by text mining in STRING, at various confidence levels; note that both STRING versions 
shown here are based on the same set of organisms and proteins. The increase in text-mining interactions is largest in the high confidence bracket, 
reflecting the increased performance enabled by the extension to full text articles, and by the improved entity recognition engine. 



(36,37). Similarly, most protein-protein interaction inter- 
faces remain well-conserved over time, at least for the case 
of stably bound protein partners located next to each 
other in protein complexes (38,39). This means that a 
pair of proteins observed to be stably binding in one 
organism can be expected to be binding in another 
organism as well, provided both genes have been 
retained in both genomes. The term 'interologs' was 
coined for such pairs, a combination of the words 'inter- 
action' and 'ortholog' (40). Whether this high degree of 
interaction conservation is true also for other, more 
indirect or transient types of protein-protein associations 
is less clear — although at least one such type, namely joint 
metabolic pathway membership, has also been shown to 
be generally well-conserved (41,42). Based on the principle 
of interaction conservation, evidence transfer from one 
model organism to the other seems feasible, and it has 
been implemented in several frameworks already. 

In practice, the search for potential interologs is not 
trivial, except for very closely related organisms. The 
reason for this lies in the high frequency of gene duplica- 
tions, gene losses and gene re-arrangements, which makes 
it difficult to assign pairs of functionally equivalent genes 
across distant organisms. The best candidates for func- 
tionally equivalent genes in two organisms are 'one-to- 
one' orthologs, i.e. genes that track back to a single gene 
in the last common ancestor of both organisms, and 
have since undergone httle or no duplication or loss 
events (43^5). In a large resource such as STRING, un- 
equivocally identifying one-to-one orthologs for all pairs 
of organisms is not feasible: there are potentially more 
than a million pairs of organisms to study, each with thou- 
sands of genes, and the proper identification of orthologs 
would ideally entail exhaustive and time-consuming 
phylogenetic tree analysis. In the past, STRING has there- 
fore used two distinct heuristic options: either to substitute 
homology for orthology (46) or to use pre-defined 
orthology relations described at high-level taxonomic 
groups, from the COG database (47). We found that 
both approaches were suboptimal; they both transferred 
evidence even when the presence of multiple paralogs 
indicated that the orthology situation was somewhat 
unclear — despite an explicit procedure to down-weigh 
the transferred scores in such cases, at least in the 
homology approach (46). We have, therefore, now 
devised a procedure that more explicitly considers the 
known phylogeny of organisms and which works on 
the basis of hierarchical orthologous groups maintained 
at the eggNOG database (33). 



The taxonomy tree covering the 1133 species present in 
STRING consists of 495 branching nodes at different 
taxonomic positions (the tree is a down-sampled version 
of the taxonomy maintained at NCBI). Through experi- 
mentation and benchmarking, we have developed a new 
two-step procedure, which makes use of this tree for the 
transfer of functional associations. First, associations 
between proteins are transferred to the orthologous 
groups to which the proteins belong; this proceeds sequen- 
tially from lower to increasingly higher levels of taxo- 
nomic hierarchy. Second, associations are transferred in 
the opposite direction, i.e. from the orthologous groups 
back to their constituent proteins. Where available, the 
hierarchical orthology groups from eggNOG version 3 
are used (33). As many of the taxonomic positions in the 
tree are not covered in eggNOG, we construct provisional 
groups for the missing positions by down-samphng the 
orthologous groups from the next higher taxonomy level 
present in eggNOG. 

To compute a score of functional association (Sahk) 
between two orthologous groups a and b at the taxonomic 
level k, we sort the n associations (/"„/„■) between their 
member proteins from highest to lowest score, and then 
integrate them sequentially (Figure 1): 

(>: 1 - /"aw/^i^minJyA 
n — I 

where p' is prior probability of two proteins being Hnked, 
which is 0.063 according to the KEGG benchmark set;/^^,- 
is a penalty dependent on the number of paralogs of a 
given protein pair and d/j is a penalty dependent on the 
similarity of the species / and the other species j that have 
already been included in the score: 

"^''^^'l+cxpWiS-s^j)] 

where c„, and c/,, are the number of proteins from a given 
species in the orthologous groups, and S/j the median simi- 
larity between the given species, measured on a universal 
set of marker gene families (48) and expressed as the 
'self-normalized bit-score' (i.e. the bit score of an ahgn- 
ment between two proteins, which is divided by the bit 
score of a self-ahgnment of the shorter of the two 
proteins; this measure always ranges from zero to one). 

The process is repeated for all pairs of orthologous 
groups at every taxonomic level. Next, the scores 
between pairs of orthologous groups are transferred 
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Step 2: Transferring orthologous group links 
bacl( to the protein level 




highest level 
of taxonomy 



medium level 
of taxonomy 



low level of 
taxonomy 



# target organism 



proteins from 
^ different species 



link between 
proteins 



orthologous 
group 

link between 
orth. groups 



'Cl A target proteins 
transferred link 



8000 



10000 



transfers to Homo sapiens 



1 1 r 

false positives 



8000 




"T 1 1 1 r 

false positives 12000 



Figure 1. Improved procedure for interaction transfer between organisms. Left: steps 1 and 2 of the functional association transfer pipeline. In the 
first step, the individual links between proteins are combined into a score between orthologous groups, sequentially, from the strongest link (thick 
line) to the weakest (thin). Each subsequent score is down-weighted, both based on the similarity of its organism to organisms that have already 
contributed to the combined scores, and on number of proteins from the same organism inside the orthologous group. In the second step of the 
transfer pipeline, the links between orthologous groups are transferred back to individual protein pairs belonging to these groups. This is done 
sequentially from the lowest to highest taxonomy level. In the above example, the two transferred links from the highest taxonomic level (orange 
links) are penalized for the increase in number of proteins from the target species in one of the orthologous groups. Right: ROC curves indicating the 
performance of predicted interolog scores, benchmarked against KEGG pathways; an inferred link between two proteins is considered to be a true 
positive when both proteins are annotated to be together in at least one shared KEGG pathway. 



back to protein pairs; this finally results in the actual 
evidence transfer between organisms. To calculate the 
transferred score (Ti^) from all taxonomic levels m to a 
protein pair from species /, we combine the scores (Satk) 
from orthologous groups consecutively from the lowest to 



the highest taxonomy level, subtracting the contributions 
from all lower taxonomic levels (Figure 1): 

1 - Sahkflhi^^^(^a,XhY 



1 



(i-/)n 



k=l 



(i-r,,t_i)(i-p„w)(i-/^') 
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where at each taxonomic level, we subtract the part of the 
score that originates from the species itself (/"„/,,) while 
additionally penalizing it for the number of paralogs in 
the respective orthologous groups (Jahd and for the 
median self-normalized bit scores {sa and Sh) of the 
proteins in the groups a and b. 

The parameters a, s and y are universal in the sense that 
they have the same values for all evidence channels in 
STRING, e.g. co-occurence, experiments and text 
mining, whereas (3 and 5 are channel specific to take into 
account the different rate at which scores become inde- 
pendent from each other. The new transfer scheme was 
optimized and benchmarked on the set of known inter- 
actions in the KEGG database and achieves better per- 
formance than the previous method, both for orthologous 
groups and for individual proteins (Figure 1). 



STATISTICAL ENRICHMENT ANALYSIS 

STRING users that do not just query with a single protein 
of interest, but instead upload entire hsts of proteins, are 
often interested in knowing whether their input shows 
evidence for a statistical enrichment of any known biolo- 
gical function or pathway. To address this question, a 
variety of dedicated online resources are already available 
(49,50), most notably the DAVID resource (51). However, 
entering gene Hsts at multiple websites can be cumber- 
some, and not all existing resources will make full use of 
the latest protein network information. Therefore, we 
have now included functionality to detect enrichment of 
functional systems in each currently displayed network in 
STRING, testing a number of functional annotation 



spaces including Gene Ontology, KEGG, Pfam and 
InterPro (see Figure 2). Any detected enrichments can 
be browsed interactively, visually highlighting the corres- 
ponding proteins in the network (Figure 2). 

In the Enrichment widget, STRING displays every 
functional pathway/term that can be associated to at 
least one protein in the network. The terms are sorted 
by their enrichment i'-value, which we compute using a 
Hypergeometric test, as explained in (53). The P-values 
are corrected for multiple testing using the method of 
Benjamini and Hochberg (54), but we also provide 
options to either disable that correction or to select a 
more stringent statistical test (Bonferroni). In the case of 
testing for Gene Ontology enrichments, users have the 
additional options to exclude annotations inferred by 
automatic procedures only (Electronic Inferred 
Associations), to limit the testing to pre-defined higher 
level categories (GO Slim), or to prune away parent 
terms that are redundant with child terms (i.e. covering 
the exact same set of proteins). 

Furthermore, we report to the user whether the protein 
hst is enriched in STRING interactions per se, independ- 
ent of known pathway annotations. The latter functional- 
ity is non-trivial and requires an explicit null model, owing 
to the non-uniform distribution of the connectivity 
degrees of proteins in networks (9,55-57). We chose a 
random background model that preserves the degree dis- 
tribution of the proteins in a given hst: the Random 
Graph with Given Degree Sequence (RGGDS), similar 
to references (55,57). 

Given a list L of proteins, let Xi denote the number of 
edges connecting proteins in an RGGDS with similar size 
as L. For the given a strong edge enrichment 




Figure 2. Network visualization and statistical analysis of a user-supplied protein list. The STRING screenshot shows a user-supplied set of genes, 
here a selection of cancer genes as annotated at the COSMIC database (52). The set is restricted to those genes that are known to pre-dispose to 
cancer already when mutated in the germline, and that have at least one connection in STRING. The inset illustrates the website's new functionality 
for automatically detecting statistically enriched functions or processes in a network. In this example, one of the detected processes (nucleotide 
excision repair) is of interest and has been selected; STRING automatically highlighted the corresponding nodes in the network, where they are seen 
to form a densely connected module. 
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corresponds to a low probability of counting, in the 
RGGDS, at least the observed number x of edges connect- 
ing proteins in L, i.e. a low value of 

Sl{x) = P{X^ > X) 

The random variable Xt is a sum of Bernoulh variables 
with distinct parameters, and hence a Poisson-Binomial 
variable. If L is large, Xt can thus be approximated by a 
Poisson random variable, whose cumulative probability 
function is: 

1 ''^ p-'^/" 

S,ix)^PiXL>x)^-J2 — , 
a ^-^ nl 

n=x 

u,v&L 

^e"^r / deg(v,)deg(v,) \ 

with M being the total number of interactions within L in 
STRING, and deg( v) denoting the degree of protein v, i.e. 
the number of interaction partners it has. 



USER INTERFACE 

The STRING website aims to provide easy and intuitive 
interfaces for searching and browsing the protein inter- 
action data, as well as for inspecting the underlying 
evidence. Users can query for a single protein of interest, 
or for a set of proteins, using a variety of different iden- 
tifier name spaces. The resulting network can then be in- 
spected, rearranged interactively or clustered at variable 
stringency. Each protein node in the network shows a 
preview to 3D structural information, if available, and 
can be chcked to reveal a pop-up window with more in- 
formation about the protein [including its annotation (58), 
SMART domain-structure (59), structure homology 
models from SWISS-MODEL Repository (60), etc.]. 
Each edge in the network denotes a known or predicted 
interaction, and leads to a pop-up window providing 
details on the underlying evidence and the interaction con- 
fidence scores. 

An important new feature in version 9.1 of STRING is 
the possibihty for users to identify themselves by logging 
in. Although this is not necessary for basic browsing and 
searching, it provides users with the option to browse their 
history of past searches, save visited pages for later return 
and upload hsts of proteins that are of interest to them. In 
addition, logging in is useful for storing and retrieving 
'payload' information to be shown and browsed alongside 
the network. As described previously (31), 'payload' infor- 
mation is user-provided extra data that can be projected 
onto the STRING network; it can consist of information 
regarding both nodes (proteins) and edges (interactions). 
Previously, any payload information had to be 
communicated to STRING via a set of files following a 



specific format — now, they can be uploaded and managed 
interactively. 
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